The provided genome measures 32803248nt, of which 13608910nt were predicted as simple or transposable repetitive elements (41.49% of the genome).
The genome sequence file with the repetitive positions lower-cased is available from here.
The genomic coordinates of the repetitive elements are available from here.
Sequencing reads were aligned to the provided genome using BWA[1].
Table 1 reports the sequencing read alignment statistics computed with Picard[2]
See https://broadinstitute.github.io/picard/picard-metric-definitions.html#AlignmentSummaryMetrics for metrics definition
| CATEGORY | TOTAL_READS | PF_READS_ALIGNED | PCT_PF_READS_ALIGNED | PF_ALIGNED_BASES | PF_MISMATCH_RATE | PF_INDEL_RATE | MEAN_READ_LENGTH | READS_ALIGNED_IN_PAIRS | PCT_READS_ALIGNED_IN_PAIRS | PF_READS_IMPROPER_PAIRS | PCT_PF_READS_IMPROPER_PAIRS | STRAND_BALANCE | PCT_CHIMERAS | PCT_ADAPTER |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| FIRST_OF_PAIR | 14641170 | 14347825 | 0.979964 | 1543165315 | 0.001961 | 0.000300 | 108 | 14309341 | 0.997318 | 101001 | 0.007039 | 0.500309 | 0.001614 | 0.002372 |
| SECOND_OF_PAIR | 14663651 | 14314351 | 0.976179 | 1532962669 | 0.003626 | 0.000294 | 108 | 14309341 | 0.999650 | 67527 | 0.004717 | 0.499734 | 0.001531 | 0.002351 |
| PAIR | 29304821 | 28662176 | 0.978070 | 3076127984 | 0.002791 | 0.000297 | 108 | 28618682 | 0.998483 | 168528 | 0.005880 | 0.500021 | 0.001573 | 0.002361 |
Figure 1 demonstrates the insert size distributions for read pairs with Forward-Reverse (FR), Reverse-Forward (RF) and TANDEM orientation.
The statistics relative to the pair-end insert size were computed with Picard and reported in Table 2
See https://broadinstitute.github.io/picard/picard-metric-definitions.html#InsertSizeMetrics for metrics definition.
| MEDIAN_INSERT_SIZE | MODE_INSERT_SIZE | MEDIAN_ABSOLUTE_DEVIATION | MIN_INSERT_SIZE | MAX_INSERT_SIZE | MEAN_INSERT_SIZE | STANDARD_DEVIATION | READ_PAIRS | WIDTH_OF_10_PERCENT | WIDTH_OF_20_PERCENT | WIDTH_OF_30_PERCENT | WIDTH_OF_40_PERCENT | WIDTH_OF_50_PERCENT | WIDTH_OF_60_PERCENT | WIDTH_OF_70_PERCENT | WIDTH_OF_80_PERCENT | WIDTH_OF_90_PERCENT | WIDTH_OF_95_PERCENT | WIDTH_OF_99_PERCENT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| FR | 509 | 473 | 81 | 2 | 1819156 | 524.1213 | 133.8351 | 14259259 | 31 | 61 | 93 | 127 | 163 | 203 | 253 | 319 | 433 | 553 | 833 |
| RF | 3954 | 3828 | 3493 | 1 | 2504221 | 8245.2586 | 9374.6391 | 9439 | 351 | 667 | 1457 | 4721 | 6987 | 11423 | 21795 | 45555 | 96961 | 148349 | 490505 |
| TANDEM | 74940 | 38 | 66820 | 1 | 2245262 | 136546.5588 | 174137.2072 | 2301 | 14741 | 41645 | 79177 | 107561 | 133641 | 148691 | 430591 | 479553 | 1032863 | 1984679 | 2905381 |
Duplicated reads (i.e. originating from a single fragment of DNA) were detected with Picard.
Table 3 reports read duplicates statistics.
See https://broadinstitute.github.io/picard/picard-metric-definitions.html#DuplicationMetrics for metrics definition
| UNPAIRED_READS_EXAMINED | READ_PAIRS_EXAMINED | SECONDARY_OR_SUPPLEMENTARY_RDS | UNMAPPED_READS | UNPAIRED_READ_DUPLICATES | READ_PAIR_DUPLICATES | READ_PAIR_OPTICAL_DUPLICATES | PERCENT_DUPLICATION | ESTIMATED_LIBRARY_SIZE |
|---|---|---|---|---|---|---|---|---|
| 77885 | 14513542 | 33143 | 642645 | 34391 | 204201 | 0 | 0.015214 | 510924154 |
The saturation curve in Figure 2 shows the return of interest (ROI) if additional sequencing data was produced from the same library.
For sequencing at higher depth the estimated gains in library complexity (i.e. the number of unique molecules in the sequencing library) does not grow linearly since more and more of the reads are duplicates.
The left most dot indicates the current ROI.
See https://broadinstitute.github.io/picard/faq.html for more info regarding the histogram produced by MarkDuplicates.
Table 4 reports the median sequencing coverage (MEDIANCOV) for individual chromosomes.
MEDIANCOVminus2MAD and MEDIANCOVplus2MAD reflect the sequencing coverage variability, indicating the median sequencing coverage -/+ 2 mean absolute deviations respectively.
| CHR | MEDIANCOV | MEDIANCOVminus2MAD | MEDIANCOVplus2MAD |
|---|---|---|---|
| LinJ01 | 83 | 65 | 101 |
| LinJ02 | 92 | 66 | 118 |
| LinJ03 | 78 | 62 | 94 |
| LinJ04 | 88 | 70 | 106 |
| LinJ05 | 130 | 106 | 154 |
| LinJ06 | 113 | 93 | 133 |
| LinJ07 | 89 | 71 | 107 |
| LinJ08 | 88 | 66 | 110 |
| LinJ09 | 116 | 94 | 138 |
| LinJ10 | 87 | 67 | 107 |
| LinJ11 | 88 | 68 | 108 |
| LinJ12 | 81 | 61 | 101 |
| LinJ13 | 80 | 62 | 98 |
| LinJ14 | 76 | 58 | 94 |
| LinJ15 | 92 | 72 | 112 |
| LinJ16 | 94 | 76 | 112 |
| LinJ17 | 79 | 61 | 97 |
| LinJ18 | 80 | 62 | 98 |
| LinJ19 | 77 | 59 | 95 |
| LinJ20 | 123 | 101 | 145 |
| LinJ21 | 79 | 63 | 95 |
| LinJ22 | 77 | 55 | 99 |
| LinJ23 | 95 | 75 | 115 |
| LinJ24 | 79 | 63 | 95 |
| LinJ25 | 80 | 64 | 96 |
| LinJ26 | 97 | 79 | 115 |
| LinJ27 | 77 | 57 | 97 |
| LinJ28 | 79 | 63 | 95 |
| LinJ29 | 79 | 61 | 97 |
| LinJ30 | 81 | 65 | 97 |
| LinJ31 | 177 | 143 | 211 |
| LinJ32 | 81 | 65 | 97 |
| LinJ33 | 80 | 62 | 98 |
| LinJ34 | 77 | 59 | 95 |
| LinJ35 | 90 | 72 | 108 |
| LinJ36 | 79 | 63 | 95 |
The genome was divided in contiguous bins spanning 300 nucleotides. For each bin GIP:
The median of the genomic bin intervals is 84.9967.
To calculate the somy score, the mean sequencing coverage of each genomic bin with mean MAPQ score > 50 was normalized by the median of all bins and multiplied by two.
Figure 3 and Figure 4 are two alternative somy score distribution visualizations.
The genome sequencing coverage density is available in bigWig format from here.
BigWig file format is compatible with genome browsers such as IGV[3].
A description of the bigWig format is available from https://genome.ucsc.edu/goldenPath/help/bigWig.html
Genomic bins with mean MAPQ score > 50 were evaluated to identify significantly amplified or depleted regions.
Considering an adjusted p-value threshold of 0.001 and considering possible user-defined additional coverage cut-offs, a total of 351 genomic bins were deemed to be either significantly enriched or depleted.
Significant bins can be found here
Collapsing significant adjacent bins into larger regions (segments) and averaging the bin coverage resulted in 42 amplified regions and 3 depleted regions available from here
| chr | start | end | score | segmentID |
|---|---|---|---|---|
| LinJ27 | 1173601 | 1175100 | 12.06 | segment_30 |
| LinJ31 | 149101 | 153900 | 6.21 | segment_36 |
| LinJ22 | 775801 | 776400 | 5.31 | segment_26 |
| LinJ12 | 417901 | 418800 | 3.99 | segment_18 |
| LinJ22 | 780601 | 781200 | 3.98 | segment_27 |
| chr | start | end | score | segmentID |
|---|---|---|---|---|
| LinJ34 | 825001 | 825900 | 0.47 | segment_39 |
| LinJ12 | 378901 | 379200 | 0.49 | segment_14 |
| LinJ12 | 462601 | 462900 | 0.49 | segment_22 |
The gene sequencing coverage was evaluated by
The MAPQ score of the reads aligning to each gene was also evaluated. Genes with mean MAPQ score > 50 were evaluated to identify significantly amplified or depleted genes. A total of 25 genes resulted significantly amplified, and a total of 0 significantly depleted
Figure 7: Overview of chromosomes embedding significant copy number variant genes
| gene_id | locus | meanCoverage | normalizedMeanCoverage | MAPQ | annotation |
|---|---|---|---|---|---|
| LINF_310009800 | LinJ31:152089-152688 | 1155.570 | 6.528644 | 56.68084 | Amastin_surface_glycoprotein_-_putative |
| LINF_340038500 | LinJ34:1411292-1413058 | 263.310 | 3.419610 | 59.97573 | hypothetical_protein_-_conserved |
| LINF_120014700 | LinJ12:452302-453693 | 213.508 | 2.635901 | 54.55921 | surface_antigen_protein_2_-_putative |
| LINF_080017600 | LinJ08:505615-508179 | 229.335 | 2.606079 | 59.99312 | protein_kinase_-_putative |
| LINF_080017300 | LinJ08:497982-498281 | 225.482 | 2.562295 | 59.99312 | hypothetical_protein_-_conserved |
The table with the sequencing coverage of significantly amplified, depleted or non-varying genes can be found here.
Single nucleotide variant (SNV) analysis include
freebayes[4]snpEff[5]| metric | count |
|---|---|
| total SNVs | 25158 |
| intergenic | 15032 |
| genic | 10126 |
| tot genes with SNVs | 3587 |
| mean number of SNVs per gene | 2.82297184276554 |
| genes with dN/dS ratio > 1 | 1681 |
| genes with dN/dS ratio < 1 | 1382 |
| genes with dN/dS ratio == 1 | 524 |
| tot synonimous variations | 4691 |
| tot non-synonimous variations | 5435 |
| chr | position | freq | AO | totDepth | ref_alt | context | EFF | gene_annotation |
|---|---|---|---|---|---|---|---|---|
| LinJ31 | 146527 | 1 | 196 | 196 | C_T | ACTTGCTCGAC | NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aGc/aAc|S483N|EXON_LinJ31_145800_147974|||LINF_310009700|1) | cytoskeleton-associated_protein_CAP5.5_-_putative |
| LinJ31 | 130288 | 1 | 182 | 182 | C_A | AAAAACCCCGA | NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|gGt/gTt|G206V|EXON_LinJ31_128094_130904|||LINF_310009300|1) | calpain-like_protein_-_putative |
| LinJ31 | 146398 | 1 | 177 | 177 | T_C | AGTACTGGGTG | NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|cAg/cGg|Q526R|EXON_LinJ31_145800_147974|||LINF_310009700|1) | cytoskeleton-associated_protein_CAP5.5_-_putative |
| LinJ31 | 147804 | 1 | 175 | 175 | A_T | ATGGCATCAAT | NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|gaT/gaA|D57E|EXON_LinJ31_145800_147974|||LINF_310009700|1) | cytoskeleton-associated_protein_CAP5.5_-_putative |
| LinJ31 | 209222 | 1 | 172 | 172 | G_A | GGAGAGCACGG | NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Ctc/Ttc|L157F|EXON_LinJ31_208161_209690|||LINF_310011400|1) | amino_acid_transporter_aATP11_-_putative |
The full set of SNVs predictions in VCF format is available from here. The filtered set of SNVs predictions in VCF format is available from here.
A tabulated version of the filtered set is available from here.
| gene | dN | dS | ratio | difference | annotation |
|---|---|---|---|---|---|
| LINF_080008400 | 24 | 5 | 4.800 | 19 | hypothetical_protein_-_conserved |
| LINF_160018200 | 18 | 4 | 4.500 | 14 | hypothetical_protein_-_conserved |
| LINF_310035400 | 15 | 2 | 7.500 | 13 | phosphatidylinositol-4-phosphate_5-kinase-like_protein |
| LINF_170010700 | 14 | 3 | 4.667 | 11 | hypothetical_protein_-_conserved |
| LINF_300017200 | 11 | 1 | 11.000 | 10 | hypothetical_protein_-_conserved |
The table including the dNdS analysis of the filtered set is available from here.
delly[6] was used to detect structural variants (SVs) based of read pairing orientation and split reads information.
The genomic SVs that were evaluated include
The file in VCF format with delly's predictions is available from here.
Downstream filters were applied on the delly's output to reduce the number of false positives.
The filtered output is available from here.
[1] Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009 Jul 15;25(14):1754-60. doi:10.1093/bioinformatics/btp324. Epub 2009 May 18. PMID: 19451168; PMCID: PMC2705234.
[2] "Picard Toolkit". Broad Institute, http://broadinstitute.github.io/picard.
[3] Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G,Mesirov JP. Integrative genomics viewer. Nat Biotechnol. 2011 Jan;29(1):24-6. doi: 10.1038/nbt.1754. PMID: 21221095; PMCID: PMC3346182.
[4] Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv:1207.3907 [q-bio.GN] 2012
[5] "A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3.", Cingolani P, Platts A, Wang le L, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM. Fly (Austin). 2012 Apr-Jun;6(2):80-92. PMID: 22728672
[6] Rausch T, Zichner T, Schlattl A, Stütz AM, Benes V, Korbel JO. DELLY:structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics. 2012 Sep 15;28(18):i333-i339. doi:10.1093/bioinformatics/bts378. PMID: 22962449; PMCID: PMC3436805.
GIP github page: https://github.com/giovannibussotti/GIP
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.7 LTS
##
## Matrix products: default
## BLAS: /usr/local/lib/R/lib/libRblas.so
## LAPACK: /usr/local/lib/R/lib/libRlapack.so
##
## locale:
## [1] C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.29
##
## loaded via a namespace (and not attached):
## [1] compiler_3.6.0 magrittr_1.5 tools_3.6.0 htmltools_0.5.0
## [5] yaml_2.2.1 stringi_1.4.6 rmarkdown_2.3 highr_0.8
## [9] stringr_1.4.0 xfun_0.16 digest_0.6.25 mime_0.9
## [13] rlang_0.4.7 evaluate_0.14